This session will give hands-on feel on conducting a bioinformatics analysis
We will cover the following sections ;
Cancer is a disease in which some of the body’s cells grow uncontrollably and spread to other parts of the body. Cancer is caused by certain changes to genes.
What are the differentially expressed genes between normal cells and cancer cells? What do we mean by differentially expressed?
Key Point: We want to be clear what analysis are we conducting.
What is R
Introduction to RStudio interface
Addition
3 + 3
## [1] 6
Multiplication
3 * 3
## [1] 9
Storing variables in R
num1 <- 5
num2 = 10
num1 + num2
## [1] 15
A more practical example. Lets create a vector storing multiple values
#create vectors to hold plant heights from each sample
group1 <- c(8, 8, 9, 9, 9, 11, 12, 13, 13, 14)
group2 <- c(22, 23, 24, 24, 25, 26, 27, 20, 26, 28)
Lets get the sum
8 + 8 + 9 + 9 + 9 + 11 + 12 + 13 + 13 + 14
## [1] 106
Now calculate the average (mean)
(8 + 8 + 9 + 9 + 9 + 11 + 12 + 13 + 13 + 14) / 10
## [1] 10.6
Key Point: R has built in function that we can use to perform mathematical / statistical operations.
Using the sum function to calculate the sum
sum(group1)
## [1] 106
Using the mean function to calculate the average
(mean)
mean(group1)
## [1] 10.6
Key Point: R has built in function to quickly visualize data.
Given our previous example of group1 and
group2, we can display the data as a barplot.
# Calculate means
means <- c(mean(group1), mean(group2))
# Make bar plot
barplot(means, names.arg = c("Group 1", "Group 2"),
col = c("skyblue", "salmon"),
main = "Average Plant Height",
ylab = "Mean Height")
Other forms of vizulaisation using R
Key Point: R has built in function to perform statistical test.
A t-test is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine whether two groups are different from one another.
\[ t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]
where:
- \(\bar{x}_1, \bar{x}_2\) are the
sample means
- \(s_1^2, s_2^2\) are the sample
variances
- \(n_1, n_2\) are the sample sizes
Variance is simply the spread of the data
df <- data.frame(
value = c(group1, group2),
group = c(rep("group1", length(group1)), rep("group2", length(group2)))
)
# Calculate means and variances
means <- tapply(df$value, df$group, mean)
vars <- tapply(df$value, df$group, var)
# Data for variance boxes
rects <- data.frame(
group = c("group1", "group2"),
xmin = c(0.7, 1.7),
xmax = c(1.3, 2.3),
ymin = means - vars/2,
ymax = means + vars/2,
mean = means,
var = vars
)
ggplot(df, aes(x = group, y = value, color = group)) +
geom_jitter(width = 0.1, size = 3, alpha = 0.7) +
# Variance box
geom_rect(data = rects, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
fill = NA, color = "red", linetype = "dashed", inherit.aes = FALSE) +
# Variance annotation
geom_text(data = rects, aes(x = group, y = ymax + 10,
label = paste0("Variance = ", round(var, 2))),
color = "red", size = 4, inherit.aes = FALSE) +
# Mean point
geom_point(data = rects, aes(x = group, y = mean), color = "black", size = 4, inherit.aes = FALSE) +
labs(title = "Variance Visualized as a Box",
y = "Value") +
theme_bw()
Lets return to our example of group1 and
group2. We want to test if the two values are statistically
/ significantly different. The t-test is performed using the
t.test() function, which essentially tests for the
difference in means of a variable between two groups.
t.test(group1, group2)
##
## Welch Two Sample t-test
##
## data: group1 and group2
## t = -13.26, df = 17.932, p-value = 1.048e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -16.10295 -11.69705
## sample estimates:
## mean of x mean of y
## 10.6 24.5
t.test saves a lot of information: the difference in
means estimate, confidence interval for the difference conf.int, the
p-value p.value, etc.
Without using the build in function, it will look something like this ;
# Data
group1 <- c(8, 8, 9, 9, 9, 11, 12, 13, 13, 14)
group2 <- c(22, 23, 24, 24, 25, 26, 27, 20, 26, 28)
# Calculate means
mean1 <- mean(group1)
mean2 <- mean(group2)
# Calculate sample variances
var1 <- var(group1) # s1^2
var2 <- var(group2) # s2^2
# Sample sizes
n1 <- length(group1)
n2 <- length(group2)
# Standard error using variances
se <- sqrt((var1 / n1) + (var2 / n2))
# t-statistic
t_stat <- (mean1 - mean2) / se
# Degrees of freedom (Welch's)
df <- ( (var1/n1 + var2/n2)^2 ) /
( ((var1/n1)^2)/(n1-1) + ((var2/n2)^2)/(n2-1) )
# p-value
p_value <- 2 * pt(-abs(t_stat), df)
# Print results
cat("p-value:", p_value, "\n")
## p-value: 1.048483e-10
Load required package. By default, R comes with some basic tools
(functions), but for tasks (like RNA-seq analysis), you need extra
tools. These extra tools are called packages. Loading a package (with
library()) means you are telling R, “Please get these extra
tools ready so I can use them in my analysis.”
library(DESeq2)
library(EnhancedVolcano)
library(dplyr)
library(scales)
library(pheatmap)
Lets read the actual data from an experiment comparing a tumor and normal RNA-seq experiment.
First we read the in the sample data and look at the files.
df.meta <- read.csv("../output/sample_data.csv")
Lets have a look at the data.
df.meta
## Run group
## 1 SRR975577 normal
## 2 SRR975581 normal
## 3 SRR975584 normal
## 4 SRR975571 normal
## 5 SRR975570 normal
## 6 SRR975578 normal
## 7 SRR975569 normal
## 8 SRR975576 normal
## 9 SRR975575 normal
## 10 SRR975572 normal
## 11 SRR975573 normal
## 12 SRR975574 normal
## 13 SRR975559 tumor
## 14 SRR975555 tumor
## 15 SRR975565 tumor
## 16 SRR975564 tumor
## 17 SRR975562 tumor
## 18 SRR975552 tumor
## 19 SRR975567 tumor
## 20 SRR975566 tumor
## 21 SRR975558 tumor
## 22 SRR975554 tumor
## 23 SRR975551 tumor
## 24 SRR975568 tumor
Check the number of samples (rows) in the data
nrow(df.meta)
## [1] 24
Check the number of tumor and normal samples
table(df.meta$group)
##
## normal tumor
## 12 12
Now we read into R the RNA-seq data. Lets look at the data first 3 rows and 5 columns. Each row is a gene and each column is a sample.
raw.counts <- read.csv("../output/raw_counts.csv", row.names = 1)
raw.counts[1:3, 1:5]
## SRR975577 SRR975581 SRR975584 SRR975571 SRR975570
## TSPAN6 1309 674 981 2271 691
## TNMD 19 5 10 20 4
## DPM1 389 268 419 743 278
How is the data structured? - Each row is a gene and each column is a sample.
What are these numbers? - They are the number of reads that mapped to each gene.
Key Point: Understand the data structure: genes in rows, samples in columns.
Lets give the example of one gene. We would like to perform a
statistical test (t.test) to determine if the expression of
the gene is different in the normal and
tumor samples.
We know the first 12 samples are normal and the next 12 are tumors.
# Extract counts for the first gene as a numeric vector
gene_counts <- as.numeric(raw.counts[1, ])
# Create vectors for normal and tumor samples
normal_counts <- gene_counts[1:12]
tumor_counts <- gene_counts[13:24]
# Perform t-test
t_test_result <- t.test(normal_counts, tumor_counts)
# Print results
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: normal_counts and tumor_counts
## t = -2.3454, df = 14.504, p-value = 0.03369
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2416.584 -111.916
## sample estimates:
## mean of x mean of y
## 1380.083 2644.333
It may be helpful to also visualize the gene expression to understand what is going on.
# Assume df.meta$group contains "normal" and "tumor"
group_colors <- c(rep("skyblue", 12), rep("salmon", 12))
barplot(gene_counts,
col = group_colors,
main = "Counts for Gene 1",
ylab = "Counts")
We will have to do 20,000 t-tests. This is not practical. And there are other considerations that needs to be accounted for, for example normalization..
Lets now perform a differential expression analysis between tumor and normal samples.
One tools that we can use is DESeq2. This was designed by Michael Love, Wolfgang Huber and Simon Anders (bioinformatics group at EMBL Heidelberg, Germany).
Its starts by creating a DESeq2 object from the count data and the sample information.
dds <- DESeqDataSetFromMatrix(countData = raw.counts,
colData = df.meta,
design = ~ group)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
We can examine what is inside the DESeqDataSet object.
dds
## class: DESeqDataSet
## dim: 18674 24
## metadata(1): version
## assays(1): counts
## rownames(18674): TSPAN6 TNMD ... BLACAT1 GIMAP1-GIMAP5
## rowData names(0):
## colnames(24): SRR975577 SRR975581 ... SRR975551 SRR975568
## colData names(2): Run group
Next we estimate the size factors and normalize the data. This is to account for the differences in sequencing depth between samples.
dds <- estimateSizeFactors(dds)
vsd <- vst(dds)
What the steps above does is that it divides each sample’s sequencing count data by the size factors (total sequencing reads) to normalize the data. We can visualize the comparison below ;
# Calculate raw library sizes
library_sizes_raw <- colSums(raw.counts)
# Choose a target library size (median or mean)
target_libsize <- median(library_sizes_raw)
# target_libsize <- mean(library_sizes_raw) # you can use mean if you prefer
# Normalize each sample so all columns sum to target_libsize
norm_counts_demo <- sweep(raw.counts, 2, library_sizes_raw, "/") * target_libsize
library_sizes_norm <- colSums(norm_counts_demo)
# Prepare data for plotting
df_libsize <- data.frame(
Sample = rep(colnames(raw.counts), 2),
LibrarySize = c(library_sizes_raw, library_sizes_norm),
Group = rep(df.meta$group, 2),
Stage = rep(c("Raw counts", "Normalized"), each = ncol(raw.counts))
) %>%
mutate(Stage = factor(Stage, levels=c("Raw counts", "Normalized")))
library(ggplot2)
ggplot(df_libsize, aes(x = Sample, y = LibrarySize, fill = Group)) +
geom_bar(stat = "identity") +
facet_wrap(~ Stage, ncol = 2, ) +
labs(title = "Before and After Normalization",
y = "Total Sequencing Read Counts",
x = "Sample") +
scale_fill_manual(values = c("normal" = "skyblue", "tumor" = "salmon")) +
theme_bw() +
theme(panel.grid = element_blank(),
axis.text.x = element_blank()) +
scale_y_continuous(label=comma) +
NULL
Once the data is normalized, we can vizualize the samples on a Principal Component Analysis (PCA) plot.
df_pca <- plotPCA(vsd, intgroup="group", returnData = TRUE)
## using ntop=500 top features by variance
plotPCA(vsd, intgroup="group") + scale_color_manual(values=c("skyblue", "salmon"))
## using ntop=500 top features by variance
A PCA plot is like a 2-D map that shows you how similar or different your samples are, using most (or all) of the gene expression data at once.
Image taken from StatQuest
We use the function DESeq() to perform the analysis.
dds <- DESeq(dds)
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 373 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
The key steps from the function above:
Key Point: In summary, these steps are all about making sure DESeq2 understands both the average and the variability of each gene’s counts, so it can accurately tell which genes are truly different between your groups (e.g., tumor vs normal), and which are just noisy.
We can extract the results of DESeq2 comparison between the tumor vs normal group for each gene.
res <- results(dds)
The results are stored in the res object. We can view
the top differentially expressed genes.
head(res[order(res$pvalue), ])
## log2 fold change (MLE): group tumor vs normal
## Wald test p-value: group tumor vs normal
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## INHBA 453.212 6.26699 0.295228 21.2276 5.30617e-100 9.02050e-96
## STC2 250.030 5.03325 0.295610 17.0267 5.20987e-65 4.42839e-61
## ETV4 728.103 5.08567 0.306103 16.6143 5.49470e-62 3.11367e-58
## CPNE7 286.688 5.45681 0.331144 16.4787 5.22186e-61 2.21929e-57
## CDH3 655.717 5.85670 0.361411 16.2051 4.64210e-59 1.57831e-55
## CGREF1 164.795 3.94658 0.253266 15.5828 9.53524e-55 2.70165e-51
baseMean - Is the average (mean) normalized expression
of a gene across all sampleslog2FoldChange - The estimated difference in expression
between the two groups, on a log2 scale.
lfcSE - The standard error (uncertainty) of the log2
fold change estimate.
stat - The test statistic (usually the Wald statistic),
calculated as log2FoldChange divided by lfcSE.
pvalue - The probability of seeing such a big
difference (or bigger) just by chance, if there’s really no
difference.padj - The p-value adjusted for multiple testing.
We can visualize the differential expression result of all
the genes in a plot called the VolcanoPlot. We use the
function EnhancedVolcano() to generate the plot.
EnhancedVolcano(res,
lab = rownames(res),
x = 'log2FoldChange',
y = 'pvalue',
title = 'Volcano plot')
Lets interpret the plot ;
The plot looks like a volcano because most genes don’t change much (clustered in the middle), but a few genes shoot up on the sides—these are the most interesting ones!
We can visualize the differential expression result of
selected significant genes as a heatmap. We
use the function pheatmap() to generate the plot.